<- c(23, 22, 35) numeric_vector
Vectors and data frame basics
Overview
In this module, we will address some basic building blocks of data frames in R. Having a basic understanding of objects, we can extend that to experience to some of the most important objects to understand when working with data, vectors and data frames. We will cover different examples of vector objects, their elements, and how they are organized into larger objects like data frames. Many functions we will use to clean and wrangle data will involve understanding and recognizing these two different object types so that you can apply a function correctly. For example, if a function accepts a vector in order to modify it, you cannot pass a data frame to it. You cannot try to brute force this approach and if you do, you will forever be frustrated. So, let’s jump deeper into some basic concepts.
Note: By design, concepts in modules will often be redundant with concepts in other modules. If you understand the concepts, just read along or skip through the content. The redundancy, however, is built in purposefully to provide additional scaffolding to those who need it and because droves or literature on cognition and memory support the importance of repetition.
Readings and Preparation
Before Class: First, read to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time.
Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.
Libraries
- {here} 1.0.1: for file path management
To Do
Read through the module. You can use the R
console or open up an R Markdown
(e.g., .Rmd
) file to follow along interactively. If you instead prefer to simply read through the content so that you can understand the concepts without coding, that is fine too. Concepts will be applied in class in order to complete activities, however. Reading the module will provide you with confidence working on those activities and prevent you from feeling lost while completing activities. Testing out some code may provide you more confidence.
Variables/Vectors are Uni-dimensional
Objects can take on different characteristics. For example, some objects may represent a single value (e.g., a numeric values or a string), some may represent multiple values like a variable or levels of a variable, some may be lists, etc.
In R
, the simplest type of data structure is known as a vector. A vector is basically a sequence of data elements of the same basic type (e.g., numeric, character, etc.). The values of a vector, whether numeric or non-numeric, are called elements or components. One way to conceptualize a vector is that it’s an object holding values of a variable. For example, you might have variables representing ages of all your participants, reaction times corresponding to response to stimuli, salaries of people in your company, education levels of employees, or IQs of individuals. In many instances, you may have variables that exist as part of a data file whereas in other instances, you may need to to create them with code. Let’s create some vectors to understand their structure.
Combining elements into vectors using c()
One way to create vectors is by using the combine function, c()
. You can combine numeric and character elements into a vector. Let’s create some examples and assign names to them.
Numeric vector:
Call the object:
numeric_vector
[1] 23 22 35
Notice that the 3 returned elements of the vector are numbers.
Character vector:
<- c("Salle", "Jane", "Beavis")
character_vector
character_vector
[1] "Salle" "Jane" "Beavis"
Now, the 3 returned elements are enclosed by quotation marks, "
. The quotes help you understand that the vector is character type.
Numeric values as a character:
<- c("23", "22", "35")
quote_num_vector
quote_num_vector
[1] "23" "22" "35"
Although the elements are numbers, they are in quotes, which indicates that the vector is character type.
Numbers and characters:
<- c(23, "22", "35")
num_char_vector
num_char_vector
[1] "23" "22" "35"
Although one of the elements of the vector is numeric, the entire vector is returned as character. This is an important characteristic of vectors in R. They can be numeric or character but not both. If there is even one number enclosed by quotes, the vector is character as seen here.
c(23, 22, 35, "30")
[1] "23" "22" "35" "30"
Creating vectors using rnorm()
Let’s use rnorm()
to create a numeric vector object that will represent sampling from a random normal distribution. The distribution will have a certain number of observations, n
, a mean, mean
, and a standard deviation, sd
. We will need to pass arguments to create the data.
Parameters/Arguments:
n
: the number of elements in the vectormean
: the mean of the elementssd
: the standard deviation of the elements
Let’s create a random normal distribution with a length of 1000 values. The mean should be 100 and standard deviation 15 (e.g., IQ distribution).
<- rnorm(n = 1000, mean = 100, sd = 15) iq
Remember, as long as the order is correct, you do not have to specify the parameter names. You only need to specify the arguments for the parameters.
<- rnorm(1000, 100, 15) iq
Look at the object’s head using head()
in order to inspect the first 6 elements:
head(iq)
[1] 79.66962 87.69924 89.07787 109.51183 99.03670 116.71565
Or inspect the first 10 elements by setting n = 10
.
head(iq, n = 10)
[1] 79.66962 87.69924 89.07787 109.51183 99.03670 116.71565 119.90918
[8] 97.07653 81.53505 89.99322
Creating Vectors as Repetitions or Sequences
Let’s say you new the first 500 people were female and other 500 were male. Create a vector and then create data frame. Two new functions are useful here: rep()
for creating replications and seq()
for creating sequences.
Combining 5 "F"
s is tedious to type
c("F", "F", "F", "F", "F")
[1] "F" "F" "F" "F" "F"
But we can replicate a string a certain number of times to be less redundant using rep()
.
rep("F", times = 5) # replicate object 5 times
[1] "F" "F" "F" "F" "F"
We can also replicate a vector that is created using c()
.
rep( c("F", "M"), 5) # replicate c() vector 5 times
[1] "F" "M" "F" "M" "F" "M" "F" "M" "F" "M"
Or by assigning the vector as an object first.
<- c("F", "M") # assign a two-element character vector
sex_levels
rep(sex_levels, 5) # replicate c() vector 5 times
[1] "F" "M" "F" "M" "F" "M" "F" "M" "F" "M"
In these examples, you see the "F"
and "M"
alternate for 5 replications.
We can also replicate using rep()
, setting each to 500, combine the repetitions using c()
and assign the to sex
.
<- c(rep("F", 500), rep("M", 500)) sex
This will return a vector that contains "F"
for the first 500 elements and "M"
for the remaining 500 elements. Let’s just inspect the first several elements using head()
to inspect the head of the vector.
head(sex)
[1] "F" "F" "F" "F" "F" "F"
And the tail of the vector:
tail(sex)
[1] "M" "M" "M" "M" "M" "M"
Alternatively, create two vector objects if that’s less confusing for you.
<- rep("F", 500)
f <- rep("M", 500) m
And then combine them and assign a name. Assigning sex
will overwrite the previous version of sex
, so do this only if you are sure you want to overwrite.
<- c(f, m) # assign the character vectors to the sex vector sex
head(sex)
[1] "F" "F" "F" "F" "F" "F"
Creating number sequences using seq()
You can dig a little more deeply into the functionality of seq()
by looking at the help documentation, help(seq)
or ?seq
.
Parameters/Arguments:
from, to
: the starting and (maximal) end values of the sequence. Of length 1 unless just from is supplied as an unnamed argument.by
: number: increment of the sequence.length.out
: desired length of the sequence. A non-negative number, which for seq and seq.int will be rounded up if fractional.along.with
: take the length from the length of this argument.
Sequences from
one value to
another
Use seq()
to create a sequence FROM 1 TO 10:
seq(from = 1, to = 10)
[1] 1 2 3 4 5 6 7 8 9 10
The help documentation will inform you that the from
and to
parameters are in the first and second position, respectively. We can specify their arguments without using the parameter explicitly.
Reference only to
:
seq(1, to = 10)
[1] 1 2 3 4 5 6 7 8 9 10
Or, remove both:
seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10
The returned result will be the same.
Sequences with increments using the by
argument
We can also modify the from
and to
step through using the by
function such that the numbers will increment by the value of by
.
seq(2, 10, by = 2) # FROM 2 TO 10 BY 2s, dropping from and to arguments
[1] 2 4 6 8 10
Using another example, create a sequence from 1 to 1000 and assign that to an object named id
.
<- seq(1, 1000) # assign a sequence of 1 to 1000 to a vector named id id
Inspecting the head
:
head(id)
[1] 1 2 3 4 5 6
Obtaining the number of elements using length()
If the number of elements in a vector variable changes, hard coding can be troublesome. We can return the length()
of the sex vector and pass that as the sequence value. This approach is useful for objects that get modified. This approach would be more flexible.
length(sex) # length will return the length of the vector, including NAs
[1] 1000
Great, we get 1000!
Creating a sequence using other functions
The argument for a function can also be another function. If we have the number of elements in a vector, we could also create a sequence from a starting number to the ending number as defined the another function.
Assign a sequence of 1 to the length()
of the sex
object and assign that to an object named id
. To make the code more easy to read, we will place the parameters on separate lines.
<- seq(from = 1,
id to = length(sex)
)
And look at the head:
head(id)
[1] 1 2 3 4 5 6
Vectors Can be Different Types
Common vectors types are numeric (integer and double/float) and character. You can check the type using typeof()
. If you were interested in knowing whether a vector is of a specific type, you can use a set of functions starting with is.
to ask whether the vector is a specific type (e.g., is.numeric() or is.double()
, is.integer()
, is.character()
, etc.). These functions will return a logical TRUE
or FALSE
.
is.double(iq) # is it a double precision/floating point?
[1] TRUE
is.numeric(iq) # is it a set of numeric values?
[1] TRUE
is.character(iq) # is it a set of characters?
[1] FALSE
is.integer(iq) # is it a set of integer values?
[1] FALSE
You can see that the values in iq
are numeric rather than character but are not integers.
Changing Vector Types
You can also change the form of vectors using a set of functions starting with as.
(e.g., as.integer()
, as.character()
, as.numeric()
). Because IQ scores are integers and not floats containing decimals, let’s just change the values created from numeric to an integer using as.integer()
.
Converting numeric or characters containing numbers to integer
Convert to integer:
<- as.integer(iq) iq
Now they are integers:
head(iq)
[1] 79 87 89 109 99 116
Converting numeric vectors to character
If the characters are numbers, pass an existing vector into as.character()
to convert
<- as.character(iq) # make is a character vector iq
Now the numbers are in quotes representing strings.
head(iq)
[1] "79" "87" "89" "109" "99" "116"
Converting character vectors to numeric
If the elements are characters, pass an existing vector into as.integer()
to convert.
<- as.integer(iq) iq
Now elements are integers, not floats.
head(iq)
[1] 79 87 89 109 99 116
You can also wrap a function in another function when the initial object is created. Here, as.integer()
converts the vector returned by rnorm()
into an integer.
<- as.integer(rnorm(1000, 100, 15)) # create initially by wrapping as.integer() around rnorm() iq
Using piping operators |>
and %>%
Alternatively, if nesting functions and reading them from the inside out confuses you, you can instead use piping operators to pass objects from function to function. Piping also improves code readability.
Base R
now includes it’s own piping operator, |>
. In the past, piping was accomplish using {magrittr} piping operators and sometimes those operators work better than the base R
pipe. The main piping operator from {magrittr} is %>%
. To use it, we will need to load the library first using library()
, though you should loads all libraries at the top of your file. We will use both piping operators to illustrate differences in functionality. I may use them interchangeably in materials.
Load the library:
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
Create the iq
vector:
<- rnorm(1000, 100, 15) # create the random normal dist iq
Check whether the vector is an integer:
is.integer(iq)
[1] FALSE
is.integer()
is a logical test, so it will return either TRUE
or FALSE
. You see that the vector is not an integer as FALSE
was returned. Because IQ scores are integers, let’s use the process to make it an integer.
<- rnorm(1000, 100, 15) %>% # create the random normal dist
iq as.integer() # pipe to make integer
Using base R
’s piping operator:
<- rnorm(1000, 100, 15) |> # create the random normal dist
iq as.integer() # pipe to make integer
Examine the head()
of the object
However you create the vector, we can examine the head of the vector using head()
:
head(iq)
[1] 131 101 106 118 124 99
A note about pipes: If you did not pipe the object from one function to another, you would need to wrap the functions as layers. The the function in the inner layer would be executed first and the function of the outer layer would be executed last.
as.integer(rnorm(1000, 100, 15))
Although there is nothing incorrect with this code, readability is compromised because you have to read the functions from the inside out. Piping objects to functions allows for reading functions from top to bottom.
Plotting the vector using hist()
And to see a plot, use hist(iq)
:
hist(iq)
Or pass the object using a pipe (e.g., |>
or %>%
) and pipe the vector to the histogram function in base R
:
%>% hist() iq
|> hist() iq
Notice that this approach changes the title. When piping objects with {magrittr}’s pipe, the object is referenced using .
so this because the name of the object.
Adding a title to the plot
TO add a title, pass a string argument to main
:
|> hist(main = "IQ Distribution") iq
Recoding vectors
Sometimes elements of vectors are messy and need to be fixed. In general, this process is referred to as recoding. There are various ways to recode in R
and there are special libraries dedicated to recoding. Now, we will perform some simple recoding.
You can use ifelse()
or dplyr::case_when()
to test whether the elements of a vector match some condition and if yes (do one thing), otherwise no (do something else). In order to understand how these functions and many others work we need to understand a logical test.
ifelse()
dplyr::case_when()
Performing a logical test
We can perform a logical test on all types of objects but this example will focus on vectors, for which all elements will be examined. The logical test will return TRUE
or FALSE
for each vector element depending on whether the case matches a test condition.
Rather than illustrate this the 1000 element sex
vector, we will create a vector of length 5 called temp_sex
. In this example, you will also see what happens when you need to clean up and recode the sloppy data. You can see are composed of both upper and lower case letters presumably corresponding to biological sex. Because R
is a case-sensitive language, "M"
and "m"
represent different objects even though the intent is for them to be the same.
<- c("M", "m", "F", "f", NA) temp_sex
You can see that some character elements in the vector are upper- and lowercase m’s and f’s. Let’s see what happens when we perform a logical test of the vector. Remember that by itself =
will be assignment (like <-
). In order to determine whether an object or elements of an object is equal to something, we use ==
.
For example, temp_sex == "M"
:
== "M" temp_sex
[1] TRUE FALSE FALSE FALSE NA
The returned vector is of the same length as the vector tested. Each element of the vector was tested and those elements that matched the character "M"
returns TRUE
. The other elements are FALSE
, except of the NA
, or missing value. Importantly, the logical test does not return FALSE
for NA
s.
Recoding using ifelse()
Let’s perform the logical test inside ifelse()
. The function will test each element and IF it is TRUE
(matches the condition) will assign 0
, else/otherwise assign 1
in order to recode the letters into numbers. Of course, you may often wish to recode letters into other letters.
ifelse(temp_sex == "M", 0, 1)
[1] 0 1 1 1 NA
We can see that all of the TRUE
s are 0
and all else except for NA
are 0
. You can see that only the "M"
was recoded to 0
and all other elements that were not NA
were recoded as 1
. Clearly, this is not correct. We can use multiple ifelse()
functions but the solution won’t be offered here because it’s just really messy. Instead of focusing on bad code, let’s just focus on offering better solutions.
Performing the same test on sex
but piping the vector to head()
will allow for inspection of the first few elements.
ifelse(sex == "M", 0, 1) |>
head()
[1] 1 1 1 1 1 1
The ifelse()
approach is fine for two groups. When there are more than two groups, however, re-coding can be confusing with ifelse()
because you’ll need to nest an ifelse()
inside an ifelse()
. The popular {dplyr} library also has a couple functions similar to ifelse()
, for example, if_else()
and case_when()
.
Recoding using dplyr::case_when()
The case_when()
function will also perform logical tests but you can easily specify more than one. The help documentation tells us that “This function allows you to vectorise multiple if_else()
statements. Each case is evaluated sequentially and the first match for each element determines the corresponding value in the output vector. If no cases match, the .default is used as a final”else” statement.” And later “If none of the cases match and no .default
is supplied, NA
is used.
With case_when()
, we will perform a logical test of the elements against the character string "M"
first. The elements that match "M"
will evaluate as TRUE
and will be recoded as 0
. Then the elements of the new recoded vector will be tested again against "F"
. Those that are TRUE
will be recoded as 1
. Anything else will be assigned NA
for missing because they matched neither "M"
nor "F"
.
We wee that only the first element returns TRUE
. Other elements are FALSE
or NA
.
::case_when(
dplyr== "M" ~ 0,
temp_sex == "F" ~ 1
temp_sex )
[1] 0 NA 1 NA NA
But note that if the character case is inconsistent, NA
’s will replace elements that are "m"
and "f"
. An easy fix for casing is to convert the vector (without assignment) by passing it to tolower()
or toupper()
and then perform the logical conversion on the case change elements.
::case_when(
dplyrtoupper(temp_sex) == "M" ~ 0,
toupper(temp_sex) == "F" ~ 1
)
[1] 0 0 1 1 NA
You can now see that all elements will evaluate to TRUE
and are recoded except for the NA
in the vector. Keep in mind that strings may be very messy and require careful inspection and cleaning. Sometimes simple case recoding solves your problems.
Note: There is a much powerful library called {stringr} for wrangling strings. We will use this library to perform other tasks. If you wanted to familiarize yourself with making strings upper or lowercase using that library, use stringr::str_to_lower()
and stringr::str_to_upper()
.
To illustrate:
library(stringr) # load the library first
::case_when(
dplyrstr_to_upper(temp_sex) == "M" ~ 0,
str_to_upper(temp_sex) == "F" ~ 1
)
[1] 0 0 1 1 NA
When you have more logical tests to perform, you can specify them on a new line. Using the existing vector, we can illustrate with a silly example.
::case_when(
dplyr== "m" ~ "Young Men",
temp_sex == "M" ~ "Old Men",
temp_sex == "f" ~ "Young Women",
temp_sex == "F" ~ "Old Women",
temp_sex )
[1] "Old Men" "Young Men" "Old Women" "Young Women" NA
We have not recoded the vector to have clear names for 4 groups, plus and NA
s. Of course, you have not changed temp_sex
unless you assign the returned vector to a name.
temp_sex
[1] "M" "m" "F" "f" NA
Assign to temp_sex
to overwrite:
<- dplyr::case_when(
temp_sex == "m" ~ "Young Men",
temp_sex == "M" ~ "Old Men",
temp_sex == "f" ~ "Young Women",
temp_sex == "F" ~ "Old Women",
temp_sex )
temp_sex
[1] "Old Men" "Young Men" "Old Women" "Young Women" NA
Vectors Contain Ordered Components/Elements
Because vectors are uni-dimensional and composed of components or elements, those components have an order or position within the vector. Some value has to be first, some value has to be last, and if there are more than 2, all other elements assume some ordered position between the two. You can examine their elements using []
. Behind the scene, this is what R is essentially doing when you call the iq
object but you can also specify an element’s position using a number or set of numbers.
Inspect all elements:
|> head() iq[]
[1] 131 101 106 118 124 99
The element in the 1000th position:
1000] iq[
[1] 101
Nothing is beyond the length of 1000 (e.g,. 1001):
1001] iq[
[1] NA
The first 5 positions:
1:5] iq[
[1] 131 101 106 118 124
Use :
use to find a range, for example the 100th through 105th positions:
100:105] iq[
[1] 110 124 99 114 112 91
A comma will not be understood by the interpreter:
iq[100,105] # will
R
will throw an error: incorrect number of dimensions
. If iq
was a data frame, the [,]
would be fine.
Use c()
to combine positions, for example, the 100th and the 105th only:
c(100,105)] iq[
[1] 110 91
Use c()
to “combine” different positions, like the 1st through 5th, 100th, 105th:
c(1:5, 100, 105)] iq[
[1] 131 101 106 118 124 110 91
Orders of Elements Can be Reordered
You might want or need to change the order of elements in vectors. One such change involves sorting with sort()
.
Sort from lowest to highest:
sort(iq) |> head()
[1] 52 54 57 58 60 62
The sort()
function has a parameter named decreasing
, which by default is set to FALSE
. In the help documentation you will see: decreasing = FALSE
. The behavior is to NOT sort in a decreasing manner. In order to reverse this behavior, we pass TRUE
.
Sort from highest to lowest:
sort(iq, decreasing = T) |> head()
[1] 149 143 143 142 141 137
You can also modify values of elements by referencing them by their index/position and assign that index a new value.
If for example, you found errors for certain positions, say c(1,52, 99, 108)
, you could set them to NA
.
Create a back for illustration:
<- iq iq_backup
The recorded values:
c(1,52, 99, 108)] iq[
[1] 131 91 113 107
Now make these elements NA
:
c(1,52, 99, 108)] <- NA iq[
Examine after reassignment:
c(1,52, 99, 108)] iq[
[1] NA NA NA NA
Restore iq
using the backup:
<- iq_backup iq
We will use {dplyr} when we clean up variables rather than use the bracket notation showed in these examples. However, understanding the behavior of vector objects in base R
is important both if you need to use it and in order to understand how other libraries may be working with R
more generally.
Removing an Object from Memory
Objects consume memory. Having active objects that you are not using can slow down your system but they can also just confuse you. Sometimes, you want to remove them. You can remove any object from memory by passing it to rm()
. Let’s remove the backup object because it’s no longer needed.
rm(iq_backup)
Data Frames are Two-Dimensional
A data frame is a two-dimensional structure in which each column contains values of one variable and each row contains one set of values from each column. Data frames are row and column objects. If you think of a data frame is a bunch of vectors, you can create a data frame from the vectors just created. Use data.frame()
.
Create a data frame with existing objects; must all be same length!:
<- data.frame(id, sex, iq) DF
Inspect the head:
head(DF)
id sex iq
1 1 F 131
2 2 F 101
3 3 F 106
4 4 F 118
5 5 F 124
6 6 F 99
Notice that column names of the data frame are inherited by the vector names. If you wanted to change the column names at creation of the data frame, just use name = vector
.
If you wanted to assign names to columns that differed from the name of the vectors, just assign using =
. Inside functions, use =
for assignment rather than <-
:
<- data.frame(Id = id, Sex = sex, Iq = iq) DF
But you could always just pass a character vector of the same length as names()
and assign them to the existing names but this requires an extra step.
What are the names of the columns in the data frame:
names(DF)
[1] "Id" "Sex" "Iq"
How many columns are there, use length()
:
length(names(DF))
[1] 3
Assign names using a vector of names:
names(DF) <- c("Id", "Sex", "Iq")
Inspect the names:
names(DF)
[1] "Id" "Sex" "Iq"
Note that the names are the same so nothing actually changes but we could have easily passed a different vector, c("Subid", "Sex", "IntelligenceQuotient")
.
Two-Dimensional Objects are Not Always Data Frames
As a word of caution, sometimes two-dimensional objects are not data frames. Use is.data.frame()
to perform a logical test. Logical tests will return TRUE
or FALSE
.
is.data.frame(DF)
[1] TRUE
A matrix with 2 dimensions:
<- matrix(1:12, nrow = 4, ncol = 3) matrix_aint_no_dataframe
Is it a data frame?
is.data.frame(matrix_aint_no_dataframe)
[1] FALSE
But it can be made into a data frame using as.data.frame()
:
as.data.frame(matrix_aint_no_dataframe)
V1 V2 V3
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
But won’t be changed unless you use assignment <-
:
matrix_aint_no_dataframe
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
Assign it to an object
<- as.data.frame(matrix_aint_no_dataframe) DF_was_a_matrix
Is it a data frame now?
is.data.frame(DF_was_a_matrix)
[1] TRUE
Examining Data Frames
There are a set of functions you can used to inspect a data frame. For example, you can look its row x column dimensions using dim()
or examine the structure of the vectors using str()
or even look at the top or bottom rows using head()
and tail()
.
Return the numeric vector of Row and Column counts:
dim(DF)
[1] 1000 3
Because there are two values, the row count is the first element:
dim(DF)[1]
[1] 1000
Because there are two values, the column count is the second element:
dim(DF)[2]
[1] 3
Return more information about each vector, use str()
:
str(DF)
'data.frame': 1000 obs. of 3 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ Sex: chr "F" "F" "F" "F" ...
$ Iq : int 131 101 106 118 124 99 130 108 89 90 ...
Or dplyr::glimpse()
:
::glimpse(DF) dplyr
Rows: 1,000
Columns: 3
$ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,…
$ Sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "…
$ Iq <int> 131, 101, 106, 118, 124, 99, 130, 108, 89, 90, 80, 101, 82, 107, 1…
Note: dplyr::glimpse()
also provides the row and column counts.
Return the first 6 rows:
head(DF)
Id Sex Iq
1 1 F 131
2 2 F 101
3 3 F 106
4 4 F 118
5 5 F 124
6 6 F 99
Return the last 6 rows:
tail(DF)
Id Sex Iq
995 995 M 122
996 996 M 123
997 997 M 101
998 998 M 88
999 999 M 89
1000 1000 M 101
Data Frame Column/Variable Names
You can look at the names of the columns in your data frame using names()
:
names(DF)
[1] "Id" "Sex" "Iq"
Or use colnames()
:
colnames(DF)
[1] "Id" "Sex" "Iq"
Changing Order of Variables in a Data Frame
If you want to rearrange the order of columns, one easy way to do this is using dplyr::relocate()
. But rather than calling library::function()
, just load the library. The first parameter in relocate()
is .data
, representing the data frame. We will omit naming this parameter because we know we are passing a data frame.
dplyr::relocate()
: returns a new order
Load the library:
library(dplyr)
Take "Age"
and put after "Id"
using relocate()
and .after
:
|>
DF relocate("Sex",
.after = "Id"
|> # then pass to head()
) head()
Id Sex Iq
1 1 F 131
2 2 F 101
3 3 F 106
4 4 F 118
5 5 F 124
6 6 F 99
Take "Age"
and put after last column:
|>
DF relocate("Sex",
.after = last_col()
|>
) head()
Id Iq Sex
1 1 131 F
2 2 101 F
3 3 106 F
4 4 118 F
5 5 124 F
6 6 99 F
Take "Age"
and put before "Id"
:
|>
DF ::relocate("Sex",
dplyr.before = "Id"
|>
) head()
Sex Id Iq
1 F 1 131
2 F 2 101
3 F 3 106
4 F 4 118
5 F 5 124
6 F 6 99
Move all numeric variables left:
|>
DF relocate(where(is.numeric)) |>
head()
Id Iq Sex
1 1 131 F
2 2 101 F
3 3 106 F
4 4 118 F
5 5 124 F
6 6 99 F
Move all numeric columns to a location:
|>
DF relocate(where(is.numeric),
.after = where(is.character)
|>
) head()
Sex Id Iq
1 F 1 131
2 F 2 101
3 F 3 106
4 F 4 118
5 F 5 124
6 F 6 99
Note that if you want to change the order of columns in the data frame, you will need to use assignment to replace the old object with the new one with a new order.
Referencing Vectors In a Data Frame Using $
Because the data frame contains vectors of certain names, the vectors in the data frame do not exist outside of the data frame object. In order to reference them, you can use the $
.
Iq
`Error: object 'IQ' not found`
But the vector does exist inside the data frame. Using the $
will allow you to specify the vector in a data frame: dataframename$vector
.
head(DF$Iq)
[1] 131 101 106 118 124 99
Inspecting Dataframe Objects using Bracket Notation []
Data frames are row and column structures, so you can inspect its elements using []
as you did to inspect vectors. However, because data frames contain both rows and columns, []
operates differently on them than on one-dimension vectors.
The 1000th element of the vector:
1000] iq[
[1] 101
Remember, the vector in the data frame is a vector too:
$Iq[1000] DF
[1] 101
First 5:
$Iq[1:5] DF
[1] 131 101 106 118 124
Note: dplyr::slice()
can do a lot of the same as the above but more about that later.
Parsing a Data Frame by It’s Rows and Columns
When examining a data frame you specify the rows and columns before and after a comma. You can pass numbers to represent the row number and or the column number or pass no numbers to see all rows and all columns (e.g., : [,]
).
|> head() # All rows and all columns -- same as DF or DF[] DF[,]
Id Sex Iq
1 1 F 131
2 2 F 101
3 3 F 106
4 4 F 118
5 5 F 124
6 6 F 99
1000, 3] # the 1000th row, 3rd column (see above for str() or dim() ) DF[
[1] 101
All rows and the Iq
column ONLY:
"Iq"] |> head() DF[,
[1] 131 101 106 118 124 99
The 1000th row and the Iq
column ONLY:
1000, "Iq"] DF[
[1] 101
Want more than one column?
DF[ 1000, "Sex" "Iq" ]
Error: unexpected string constant in "DF[ 1000, "Sex" "Iq""
Use c()
to combine them and then pass the combined names as the argument:
c("Sex", "Iq")
[1] "Sex" "Iq"
1 through 10th row and BOTH Sex and IQ columns:
1:10, c("Sex", "Iq") ] DF[
Sex Iq
1 F 131
2 F 101
3 F 106
4 F 118
5 F 124
6 F 99
7 F 130
8 F 108
9 F 89
10 F 90
Modifying a Data Frame Using base R
There are ways to modify a data frame using base R or by using dplyr
.
Using base R, you can add or modify variables in a data frame. Be careful not to overwrite an existing variables unless that is your intention.
Assign a New Variable using $
.
Add the year as a string or character vector; assigns same string to all Rows
$New <- "2021" DF
head(DF$New)
[1] "2021" "2021" "2021" "2021" "2021" "2021"
Add the year as a numeric vector; assigns same value to all rows:
$New2 <- 2021 DF
head(DF$New2)
[1] 2021 2021 2021 2021 2021 2021
Using the $
can sometimes be clunky. But you can also assign a new variable using []
.
Add the year as a string or character vector; assigns same string to all Rows
"New"] <- "2021"
DF[,
head(DF$New)
[1] "2021" "2021" "2021" "2021" "2021" "2021"
Add the year as a numeric vector; assigns same value to all rows:
"New2"] <- 2021
DF[,
head(DF$New)
[1] "2021" "2021" "2021" "2021" "2021" "2021"
Modifying an Existing Variable in a Data Frame Using base R
You have already done this above. Whenever you assign a vector to an existing column of a data frame, you will overwrite it.
names(DF) # returns the names of DF. New and New2 are there
[1] "Id" "Sex" "Iq" "New" "New2"
$New <- 1 # or DF[, "New] <- 1 DF
head(DF$New)
[1] 1 1 1 1 1 1
Rename an Existing Variable in a Data Frame Using base R
There are various ways to rename column variables. You can assign an existing vector to a new column and simply remove the old column, you can change the name manually, and you can use libraries.
Find the column number and assign a new string to it.
names(DF) # New looks like position 4
[1] "Id" "Sex" "Iq" "New" "New2"
Is the name in the 4th position?
names(DF)[4]
[1] "New"
Can we rename it?
names(DF)[4] <- "NewName"
Position 4 now has a new name.
names(DF)
[1] "Id" "Sex" "Iq" "NewName" "New2"
Assign an existing column vector to a new column vector, then reassign the data frame by eliminating the old one.
Assign NewName to NewName2:
$NewName2 <- DF$NewName DF
Check the column names. NewName2
is there.
names(DF)
[1] "Id" "Sex" "Iq" "NewName" "New2" "NewName2"
Use c()
to combine the column names you want to keep.
c("Id", "Sex", "Iq", "NewName2") ] |> head() DF[,
Id Sex Iq NewName2
1 1 F 131 1
2 2 F 101 1
3 3 F 106 1
4 4 F 118 1
5 5 F 124 1
6 6 F 99 1
Assign this new subsetted data frame to DF
.
<- DF[, c("Id", "Sex", "Iq", "NewName2") ] DF
Now gone.
names(DF)
[1] "Id" "Sex" "Iq" "NewName2"
Examine the Levels of a Variable in a Data Frame
Levels are for factor variables, so you need to check whether you are dealing with a factor if you wanted the variable to be a factor.
levels(DF$Sex) # list levels of variable in DF. Retuns NULL if it's not a factor
NULL
head(DF$Sex)
[1] "F" "F" "F" "F" "F" "F"
Is it a factor?
is.factor(DF$Sex) # nope
[1] FALSE
Reassigning the vector as a factor. Use as.factor()
to wrap the object and then assign to existing.
$Sex <- as.factor(DF$Sex)
DF
is.factor(DF$Sex) # Now it is
[1] TRUE
head(DF$Sex) # looks different
[1] F F F F F F
Levels: F M
levels(DF$Sex) # Has 2 levels
[1] "F" "M"
Modifying/Creating Variables in a Data Frame Using dplyr
The mutate()
function will allow for creating new variables or changing existing variables. Similar to overwriting variable names with mutate()
, rename()
will rename them. Passing the data frame with %>%
or |>
making the code easy to follow.
Single instance
head(DF)
Id Sex Iq NewName2
1 1 F 131 1
2 2 F 101 1
3 3 F 106 1
4 4 F 118 1
5 5 F 124 1
6 6 F 99 1
str(DF)
'data.frame': 1000 obs. of 4 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Iq : int 131 101 106 118 124 99 130 108 89 90 ...
$ NewName2: num 1 1 1 1 1 1 1 1 1 1 ...
|>
DF ::mutate(Sex_f = as.factor(Sex), # convert the character to a factor
dplyrNew_var1 = 0, # set to some constant
New_var2 = Iq/2, # using math
New_var3 = dplyr::case_when( # conditional...
<= 80 ~ 1,
Iq > 80 & Iq <= 115 ~ 2,
Iq > 115 ~ 3,
Iq
),New_var4 = "Exp 1", # a constant character
New_var5 = dplyr::ntile(Iq, 5) # quintiles based on specific variable
|>
) ::mutate(
dplyrNew_var4 = paste(New_var4, "*", sep = "") # mutate() to change a variable too
|>
) ::rename( Pid = Id) |> # then rename Id to Pid
dplyrhead()
Pid Sex Iq NewName2 Sex_f New_var1 New_var2 New_var3 New_var4 New_var5
1 1 F 131 1 F 0 65.5 3 Exp 1* 5
2 2 F 101 1 F 0 50.5 2 Exp 1* 3
3 3 F 106 1 F 0 53.0 2 Exp 1* 4
4 4 F 118 1 F 0 59.0 3 Exp 1* 5
5 5 F 124 1 F 0 62.0 3 Exp 1* 5
6 6 F 99 1 F 0 49.5 2 Exp 1* 3
Multiple instances
When you wish to perform the same type of function or operation across a set of variables, mutating each individually is unnecessary unless you like working harder and not smarter. For such cases, dplyr::across()
will serve you well. However, given each new variable needs it’s own name (if it’s not obvious, column variables cannot be redundant), you’ll need to pass some special code as the argument for .names
so that your variable names are unique and are meaningful to your goal.
Note that the argument for .names
operates in way a similar to how paste()
concatenates character strings. The new names will be the result of gluing together the column names by passing {.col}
along with other characters.
A couple key arguments will need to be passed for .cols
, .fns
, and .names
.
across(.cols = the_columns, .fns = the_funtion_to_apply, .names = the_new_var_names)
The examples below use across()
within mutate()
but you can also pair it with summarise()
. Here are some examples for creating z-scores for numeric variables by passing variables to .cols
by their names with a character vector using c()
, by their starting characters using starts_with()
, by their contained characters using contains()
, and by their numeric type using where()
.
|>
DF # across variables in character vector
mutate( across(.cols = c("Iq"), # take the Iq col
.fns = scale, # scale function
.names = "z_{.col}") # to new name 'z_' + 'Iq'
|>
)
# across variables with character match starting with
mutate( across(starts_with("Iq"),
scale, .names = "zIq_{.col}")
|>
)
# across variable with characters contained in
mutate( across(contains("i"),
scale, .names = "zi_{.col}")
|>
)
# across all numeric types
mutate( across(where(is.numeric),
scale, .names = "znum_{.col}")
|>
) head()
Id Sex Iq NewName2 z_Iq zIq_Iq zi_Id zi_Iq zi_z_Iq
1 1 F 131 1 2.12359318 2.12359318 -1.729454 2.12359318 2.12359318
2 2 F 101 1 0.12202871 0.12202871 -1.725992 0.12202871 0.12202871
3 3 F 106 1 0.45562279 0.45562279 -1.722530 0.45562279 0.45562279
4 4 F 118 1 1.25624858 1.25624858 -1.719067 1.25624858 1.25624858
5 5 F 124 1 1.65656147 1.65656147 -1.715605 1.65656147 1.65656147
6 6 F 99 1 -0.01140892 -0.01140892 -1.712142 -0.01140892 -0.01140892
zi_zIq_Iq znum_Id znum_Iq znum_NewName2 znum_z_Iq znum_zIq_Iq
1 2.12359318 -1.729454 2.12359318 NaN 2.12359318 2.12359318
2 0.12202871 -1.725992 0.12202871 NaN 0.12202871 0.12202871
3 0.45562279 -1.722530 0.45562279 NaN 0.45562279 0.45562279
4 1.25624858 -1.719067 1.25624858 NaN 1.25624858 1.25624858
5 1.65656147 -1.715605 1.65656147 NaN 1.65656147 1.65656147
6 -0.01140892 -1.712142 -0.01140892 NaN -0.01140892 -0.01140892
znum_zi_Id znum_zi_Iq znum_zi_z_Iq znum_zi_zIq_Iq
1 -1.729454 2.12359318 2.12359318 2.12359318
2 -1.725992 0.12202871 0.12202871 0.12202871
3 -1.722530 0.45562279 0.45562279 0.45562279
4 -1.719067 1.25624858 1.25624858 1.25624858
5 -1.715605 1.65656147 1.65656147 1.65656147
6 -1.712142 -0.01140892 -0.01140892 -0.01140892
If your data frame or tibble contains only numeric variables, however, you can use mutate_all()
. Similarly, you can subset your data frame to include only the numeric variable and use mutate_all()
but the previous function approaches would likely be better to use if you want the other variables retained in the data frame.
Here are two examples of sub-setting and then creating by selecting using select_if(is.numeric)
or select(where(is.numeric))
:
select_if(is.numeric)
: selects columns IF they are numericselect(where(is.numeric))
: selects columns where there are numeric variables
|>
DF select_if(is.numeric) |> # if type function
select(-c("Id", "NewName2")) |> # select out
mutate_all(~scale(.x)) |>
head()
Iq
1 2.12359318
2 0.12202871
3 0.45562279
4 1.25624858
5 1.65656147
6 -0.01140892
|>
DF select(where(is.numeric)) |> # where there are numerics
select(-c("Id", "NewName2")) |> # select out
mutate_all(~scale(.)) |>
head()
Iq
1 2.12359318
2 0.12202871
3 0.45562279
4 1.25624858
5 1.65656147
6 -0.01140892
Arranging/Sorting a Data Frame using dplyr
For arranging using dplyr
, you will use arrange()
and at very least pass arguments for the data frame (first) followed by the variables (what the documentation refers to as ...
. You can add other arguments as well, however, but R will need to know what to arrange and how to arrange it.
arrange(.data, ..., .by_group = FALSE)
Note, the variables can be represented as separated arguments in the function separated by commas or they can be passed as a single argument as a single vector containing the variable.
Examples:
arrange(var1, var2)
: the variables by their namesarrange_at(c("var1", "var2"))
: the variables as a combined vector
By default, arrange()
will sort in a ascending manner.
|>
DF arrange(Iq) |>
head()
Id Sex Iq NewName2
1 475 F 52 1
2 946 M 54 1
3 778 M 57 1
4 279 F 58 1
5 355 F 60 1
6 46 F 62 1
To sort in a descending manner, wrap the variable in desc()
.
|>
DF arrange( desc(Iq)) |>
head()
Id Sex Iq NewName2
1 630 M 149 1
2 196 F 143 1
3 638 M 143 1
4 864 M 142 1
5 166 F 141 1
6 583 M 137 1
|>
DF arrange( desc(Sex)) |>
head()
Id Sex Iq NewName2
1 501 M 117 1
2 502 M 94 1
3 503 M 104 1
4 504 M 120 1
5 505 M 103 1
6 506 M 84 1
Sorting on multiple variables…
|>
DF arrange( Sex, Iq) |>
view_html( show = 10)
Arrange based on variables containing "se"
(e.g., Sex):
|>
DF arrange(across(contains("Se"))) |>
view_html(show = 10)
If you want to pass a vector of names, you’ll need to use an unquote
operator to unquote the vector object so that dplyr understands it. A double bang !!
will unquote one character vector and a triple bang !!!
will unquote more.
|>
DF arrange(!!! rlang::syms(c("Sex", "Iq"))) |>
view_html( show = 10)
Or:
<- c("Sex", "Iq")
vars
|>
DF arrange(!!! rlang::syms(vars)) |>
view_html( show = 10)
However, this is confusing. There is an old function named arrange_at()
which makes this easier. As of now, the function is not deprecated and there is no plan to remove it from dplyr
. All you need to do is pass the vector object and the data frame will be sorted in the indexed order.
|>
DF arrange_at(vars) |>
view_html(show = 10)
Flagging Complete (or missing) Cases Across a Variable Group
If you want to find out quickly whether a case/row contains completed cases for a certain grouping of variables, you can subset the variables by name and create a new variable and assign a logical TRUE
or FALSE
.
|>
DF mutate(complete_all = complete.cases(across(everything()))) |> # across all variables
mutate(complete_i = complete.cases(across(contains("i")))) |> # those containing
mutate(complete_num = complete.cases(across(where(is.numeric)))) |> # those numeric
head()
Id Sex Iq NewName2 complete_all complete_i complete_num
1 1 F 131 1 TRUE TRUE TRUE
2 2 F 101 1 TRUE TRUE TRUE
3 3 F 106 1 TRUE TRUE TRUE
4 4 F 118 1 TRUE TRUE TRUE
5 5 F 124 1 TRUE TRUE TRUE
6 6 F 99 1 TRUE TRUE TRUE
Session Information
sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] magrittr_2.0.3 htmltools_0.5.8.1 DT_0.33 vroom_1.6.5
[5] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[9] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[13] ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.4
[5] hms_1.1.3 digest_0.6.36 evaluate_0.24.0 grid_4.4.1
[9] timechange_0.3.0 fastmap_1.2.0 R.oo_1.26.0 rprojroot_2.0.4
[13] jsonlite_1.8.8 R.utils_2.12.3 fansi_1.0.6 crosstalk_1.2.1
[17] scales_1.3.0 jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4
[21] crayon_1.5.3 R.methodsS3_1.8.2 bit64_4.0.5 munsell_0.5.1
[25] cachem_1.1.0 withr_3.0.1 yaml_2.3.10 tools_4.4.1
[29] tzdb_0.4.0 colorspace_2.1-0 here_1.0.1 vctrs_0.6.5
[33] R6_2.5.1 lifecycle_1.0.4 htmlwidgets_1.6.4 bit_4.0.5
[37] pkgconfig_2.0.3 bslib_0.8.0 pillar_1.9.0 gtable_0.3.5
[41] glue_1.7.0 xfun_0.45 tidyselect_1.2.1 rstudioapi_0.16.0
[45] knitr_1.47 rmarkdown_2.27 compiler_4.4.1